feat: Batch vote metrics (BFT-486) #157

aakoshh · 2024-07-23T14:09:27Z

What ❔

Adds metrics around the gossiping of batch votes and the storage of batches.

Added gossip::metrics::BATCH_VOTES_METRICS:

network_gossip_batch_votes_committee_size to show how many attesters we have, which can be correlated with the number of votes we receive over time. Wanted to add a weight based one as well, but not sure if counting the weights monotonically would not overflow the counter.
network_gossip_batch_votes_votes_added counts the votes received since start; with its rate-of-change we can get an idea of how many attesters are producing votes
network_gossip_batch_votes_weight_added counts the total weight of votes added since start, each normalized to [0; 1] range with the total committee weight; this way it shouldn't overflow, and its rate of growth should be 0.8-1.0 per batch.
network_gossip_batch_votes_min_batch_number shows the minimum batch we expect votes for; going up will indicate that QCs are formed
network_gossip_batch_votes_last_added_vote_batch_number is the number from the last vote we registered; apart from Byzantine votes it should average to go up as L1 batches are created and votes are received, even if a QC cannot be formed
network_gossip_batch_votes_last_signed_batch_number indicates that the current node is publishing its votes and keeping up with the L1 batches

Also implement metrics for the BatchStore similar to how the BlockStore has them:

next for queued and persisted
latencies for the PersistedBatchStore operations
latencies for the wait_for_... methods
zksync_consensus_storage_batch_store_last_persisted_batch_qc: last batch QC saved

The latencies for some of the BatchStore methods could potentially be less cumbersome if we used instrumentation.

Why ❔

To make the voting process observable.

node/actors/network/src/gossip/batch_votes.rs

RomanBrodetski · 2024-07-24T07:19:02Z

node/actors/network/src/gossip/metrics.rs

+    pub(crate) last_added_vote_batch_number: vise::Gauge<u64>,
+
+    /// Batch number of the last batch signed by this attester.
+    pub(crate) last_signed_batch_number: vise::Gauge<u64>,


can postpone this, but can we have a histogram that would show the distribution of votes delay after an l1 batch was produced?

Sounds interesting, I will create a ticket for this. There are a few challenges:

the timestamp of the batch isn't available here

currently we can receive votes earlier than we have the batch itself, so we would have to store a timestamp with the vote, and the retrospectively emit negative latencies when the batch is produced on the local node

We can have some specific workarounds:

For example when the last_signed_batch_number above changes it can act as an approximation of when the L1 batch was produced, but it's only an approximation because it depends on the polling frequency, and also it's only available on nodes which are attesters.

We could observe the time elapsed between the first and last vote. For attester nodes this would indicate the speed of propagation of votes over gossip, compared to the vote produced locally, although again the local vote might come later than the first gossiped one.

We could implement Grzegorz's idea about only gossiping votes to nodes which have indicated they have the batch already, which would deal with the negative values; this simplifies the metrics in that we don't have to buffer timestamps and observe them the first time we learn about the batch.

We could observe the difference between l1_batches.created_at and l1_batches_consensus.created_at when we insert the QC, which would be the overall latency of gossiping votes.

https://linear.app/matterlabs/issue/BFT-497/add-metric-for-l1-batch-vote-latency

aakoshh · 2024-07-25T10:39:02Z

Added the following metrics to the Consensus Dashboard on Grafana:

The queries behind them are the following:

max (network_gossip_batch_votes_last_signed_batch_number) by (pod) != 0: last vote published by attesters
max (zksync_consensus_storage_batch_store_last_persisted_batch_qc) by (pod): last batch QC saved by nodes
max (network_gossip_batch_votes_last_added_vote_batch_number) by (pod): last vote received and added by nodes
min (network_gossip_batch_votes_min_batch_number) by (pod): minimum accepted vote by node
max(rate(network_gossip_batch_votes_votes_added[10m]) * 600) by (pod): number of votes added per minute over the last 10 minutes
max(network_gossip_batch_votes_committee_size) by (pod): committee size to compare against number of votes added

BFT-486: Batch vote metrics

8a44e6d

aakoshh force-pushed the bft-486-batch-metrics branch from 928cf60 to 8a44e6d Compare July 23, 2024 14:10

aakoshh added 3 commits July 23, 2024 15:27

BFT-486: Move batch_store to subdir

ee62c6c

BFT-486: Latency metrics like the blockstore

53ccc31

BFT-486: Add last persisted batch QC

cc39361

aakoshh marked this pull request as ready for review July 23, 2024 15:06

aakoshh requested a review from pompon0 as a code owner July 23, 2024 15:06

pompon0 reviewed Jul 23, 2024

View reviewed changes

node/actors/network/src/gossip/batch_votes.rs Show resolved Hide resolved

BFT-486: Fix docstring about returning stats

c1933c2

pompon0 approved these changes Jul 23, 2024

View reviewed changes

aakoshh added 2 commits July 23, 2024 18:08

BFT-486: Add weight_added

1fef2fb

Merge branch 'main' into bft-486-batch-metrics

650aab4

RomanBrodetski reviewed Jul 24, 2024

View reviewed changes

RomanBrodetski approved these changes Jul 24, 2024

View reviewed changes

aakoshh merged commit 1075fc6 into main Jul 24, 2024
5 checks passed

aakoshh deleted the bft-486-batch-metrics branch July 24, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Batch vote metrics (BFT-486) #157

feat: Batch vote metrics (BFT-486) #157

aakoshh commented Jul 23, 2024 •

edited

Loading

RomanBrodetski Jul 24, 2024

aakoshh Jul 24, 2024

aakoshh Jul 24, 2024

aakoshh commented Jul 25, 2024 •

edited

Loading

feat: Batch vote metrics (BFT-486) #157

feat: Batch vote metrics (BFT-486) #157

Conversation

aakoshh commented Jul 23, 2024 • edited Loading

What ❔

Why ❔

RomanBrodetski Jul 24, 2024

Choose a reason for hiding this comment

aakoshh Jul 24, 2024

Choose a reason for hiding this comment

aakoshh Jul 24, 2024

Choose a reason for hiding this comment

aakoshh commented Jul 25, 2024 • edited Loading

aakoshh commented Jul 23, 2024 •

edited

Loading

aakoshh commented Jul 25, 2024 •

edited

Loading